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ABSTRACT 

A new approach to assessing uaexpected differential 
item performance (item bias or item fairness) is introduced and 
applied to the- item responses of different subpopulations of > 
Scholastic Aptitude Test (SAT) takers. The essential features of the 
standardization approach. ar6 described. The primary goal of the 
standardization approach is to control for differences in 
subpopulation ability before making comparisons between subpopulation 
performance on test items. By so doing, it removes the contaminating 
effects of ability differences from the assessment of item fairness. 
The approach is capable of identifying rare individual instances 
(outliers) of unexpected differential item performance (that can 
sometimes be attributed to unfair content), as well as differences on ;^ 
groups of items which might be attributed to the fact that these 
items are measuring different attributes in different subpopulations. 
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Precis 



< A new approach to assessing unexpected differential item performance (item bias^ 
or^it^m fairness) is introduced and applied to the item response^ of different 
subpopulations of Scholastic Aptitude Test (SAT) takers. The essential features of 
the standardization approach are described. The pric^ry goal of the standardization 
approach'^is to control for differences in subpopulation ab^ility before making 
comparisons between subpopulation performance on test' items. By so doing, it removes 

. ( - 

the^ cont angina ting effects of ability differences from the assessment of item 
fairness. The approach is shown to be capable of identifying rare individual 
instances (outliers) of unexpected differential item performance (that can sometimes 
be attributed to unfair content), as well as differences on groups of items which 
might be attributed to the fact that these items are measuring different attributes 
in different subpopulations. 
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The Standardization Approach to Assessing 
Unexpected Differential Item Performance 



:ent years much attention has been directed to the issue of fairness 
in ed^i^onal and psychological tests. At Educational Testing Service (ETS), 

thps^->ho develop and review the Scholastic Aptitude Test/Test of Standard 

/' - ^^^^ • V ■ ' ^ . 

Wirit^K Ei;iglish (SAT/TSWE) are aware of. the diversity of the test-taking 

' popuHtton ^nd attempt to construct tests based on a ts.road sampling of tasks 

arid topics that tend not to favor any subgroup of the population. In 

addition, there are a'number of procedures which ETS has instituted^ (Donlon, 

1981), including sensitivity reviews ai;id statistical checks, in order to guard 

•* ■ . ' 

against possible favoritism on the SAT towards any subgroup. Nevertheless, 
despite these efforts, the importance and complexities inherent "in the nature 
of item fairness necessitate post hoc investigation^ to evalute the 
effectiveness of these safeguarding procedures. This paper summarizes the ^ 
findings of- four studies that used the statistical method of standardization 
to examine whether there are unexpected differences in" item performance across 
different subpopulations of the SAT test-taking population. In addition, .a 
brief iptroduction to the standardization method is presented. ' ^ 

Standardization Methodology 

An it,em is exhibiting unexpected differential item performance when the 
probability of correctly answering ttue item is lower for examinees from one 
group than for examinees of equal ability from another group or groups. This 
definition may be formalized mathematically by letting S represent ability as 
' measured' by total score on the standard College Board 200-to-800 SAT scale (or 
on the 20-to-60 TSWE scale), and X represent an item score (1 if the answer to 
the question is. correct and 0 if the answer is incorrect). An item. 



then.', .is free of unexpected diffierential item performance , when it satisfies 
the , following equality , ' 

i» (X=l| s)=P , (X=l|s) for all subpopulations g and g'; • 

wher*(?"P (X=l|S) is defined as the probability that candidates from * 
subpopulation g who have total test scores equal to S will answer the iteiq 

correctly. For example, if male and female candidates with the sam^ , total 

) ' . . ' . . . , ... _ . ■ / ' . 

.' ■ ■ ■ ' * ' ' " . . ' ' ■ ■ ' . . ' - ' ' 

test scores do not have equals probabilities of successful perfomjance on the 

.' ' ■ ' • ' ' ■ . ' * .' ■ ■ ' , ■ 

item, this diffierence in prpbabilities ts taken as evidence of unexpected 

differential itei^t performahce for male and female candidates at this score 

level. Note that a lack of unexpecte iffe'rential Jtem performance does not 

imply that there will -not be any ob^rved differences in item performance' ' 

across subgroups of the SAT candidate population, . buf , that, there - are no 

differences in conditional item performance across subgroi?p<s wh^n the ' 

requisite condition before comparison is identical total test. score. The 

reference to this type of differential performance as "un^pected" is 

purposeful, in order to emphasize that the focus ought to be on differences 

between candidates of equal score level, among whoip^^iae would not expect to 

find any differences. Thi^ represents an important diyfi^inctioh from observed 

differences in item performance between groups of ^^^i^ing ability, where some 

differences are of course expected. ' 

Previous methods used to appraise unexpected differential* item performance 

typically have been hampered by sensitivities to differences in overall 

subpopulation ability or dif f erences iti item quality (discrimination). The 

standardization methodology, however, controls for differences in both 

subpopulation ability and in item quality. . Statidardizatipn is used , here to 



mean that differences on one variable have beeji .controlled for prior td making 
>. ' ; ^ . '' ■ ,^ " , ' . ^ ^ ' ' , 

comparisons between groups on^some othe^ related variable. A general approach 

to assessing unexpected differences in i't^m performance via- standardization is 

described in;deta?l iti fibrans ^and Kulick'^'^bPSS) The essential -features of ' 

the method as applfjp*d to the SAT are as follows: Using the standard College 

Board 200-800 SAT scale one can establish 61 individual ability levels (200, 

210, 220, etc.). The probability that aft examinee at a given. ability level 

will correctly ahswer an item can be estimated by the obsei^/ed percent correct 

among those with the given scaled score. Studies of unexpected dif f erent-ial 

item performance focus on differences between two oV mare groups. One group 

is arbitrarily designated as the base" group. The base group is used to 

estimate the conditional probability of successful item performance given. 

score level. Usually the group that provides the most stable estimates of the 

jconditional proba]3ilities across the entire scaled score range is selected as 

the base group. .Typically, but not always, this is the largest group. IJhe 

remaining groups are referred to as. st,udy groups^ or comparison groups. 

Several indices used in the standardization process may be 'defined. is 

the overall percent correct in the base group for an item. P, " is the percent 

bs 

coYriect at ability level s in the base group. P is the overall percent 

* 8 * 

correct in the study group. P^^ is the percent correct at ability level s in' 
the study group*. P^ and P^ are, not directly comparable when the base group 
and ^study group have different marginal ability distributions. It is 

necessary to calculate the expected item performance of the study group, P . 

- 8 ^ 

P is computed by taking a weighted sum of the frl conditional probabilities of 
8 

successful item performance observed in the bas^ group, P^g» where the 



relative frequencies at each of the 61 scaled score levels in a designated 

group serve as the weights. The designated group, that supplies the frequency 

distribution. to be used as weights is referred^ to as the standardization 

group. Having the study group also serve as th*e sjtandardization group (as was 

done in the four studies presented here)^ insures that the most important : . 

conditional probabilities are weighted most heavily, i.e., conditional 

probabilities at .'those score levels most attained by the stydy group. 

The most precise measure of differential itSem performance is at the 

individual scaled score level, D = P - P, . These differences can be 

gs . gs bs 

combined across score levels in a variety of ways to obtain a number of 

summary indices of unexpected differential item performance. Plots of these 

differences, as well as plots of • P and P, are helpful to visualize the 
' . ■ gs bs 

quantification of unexpected differential item performance (see Figures 1-4). 
Figures 1 and 2 ^^epict an item that is performing fairly for both groups. 
Figures 3 and 4 portray an item that, is unexpectedly difficult for females'. 
Jhe top figures (1 and 3) present the conditional probabilities of successful 
item' performance f9r males and females. These curves may also be thought of 
as nonparametric item-test regressions or empirical it;em characteristic 
curves. The loyer Figures (2 and 4) ar? simply plots of the group- differences 
observed above. 



obtained by squaring each difference in conditional probabilities of 
successful item performance between the study and base groups, ID , taking a 
weighted sum of these squared dif f erences , . and taking the s'quareWoot of the 



One of the most informative indices summari>^ing these differences is the 

root mean weighted squared difference (RMWSD ). the RMWSD fpi: an item is 

8 o 



0 

- 5 - . • . . 

weighted sum, where the.'relative frequency distribution of the standardization 
group serves as the weighting function. Since this index is unsigned, any 
difference produces a positive discrepancy. Consequently, every item jjjUI 

' K 

have a non-negative value of RMWSD . "An item exhibiting substantial 

■* ' * ' ' o- . 

unexpected different i,tem performance will have a large RMWSD . An itera^ 

* S 

exhibiting^ absolutely no unexpected differential item performance will have a 

RMWSD equal to zero. . ' ^ 

The difference (D ) between P and P , (D = P -Jp ),' is another index of 
g 8 g g g g 

unexpected differential item performance. If there is no unexpected ' , 

differential item performance between the study group and* base group, D 

' •• g 

should equal ^erp. A positive D indicates that the study group exceeds its . ^ 

g 

expected performance, while a negative D indicates that the item is harder \ 

g 

than expected for the study group.. * * ^ 

A problem faced by any investigation which seeks to detect and .quantify 
unexpected differential item performance, regardless of methodology, is the 

t 

determination of what level of unexpected differential item performance should 

evoke concern. In thK, first report using the standardization approach (Dorans 

and Kulick, 1983), an empirical determination was made concerning the^ 

practical cutoff point for values of RMWSD using frequency distributions of 
t. « g 

the RMWSD index. According to this determination, an item with a RMWSD 

g ' c . ^ g - 

\ ' " ' 

greater than or equal to .08 merits careful investigation, while an item with 

r 

a RMWSD less than« .08 does not require additional study. Items with RMWSD 
g g 

greater than or equal to .16 are exhibiting clearly unacceptable levels of 

differential performance. Figure 5 presents*a plot of the RMWSD- index for a 

. g 

set \)f verbal items. The value of RMWSD equals the distance from the origin 
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to the point representing the item. Projection of each point on the 
horizontal axis yields for' that item.'^ Most of the items in this figure 
fall within the smallest arc. Ong item, however, can be seen falling outside 
the second arc. This is clearly an outlier exhibiting a high l^vel of 
unexpected differential item performance. . . 

Results Using the Standardization Method 

Four studies have been completed to date employing the standardization 
approach to item bias. The findings from these studies are briefly summarize 
below. " • * 

The first investigation compared the performaT^ce of male and female 

candidates on a form of the SAT administered in 1977. Essentially there was 

very little evidence of unexpected differential item performance. Figure 5 

shows the distribution of RMWSD^ valuj^ on the verbal test. A few items are 

in the region where they should be examined more closely, ^but the most 

striking feature of the plot is the analogy outlier. Clearly this item is 

exhibitingr an unacceptable level of unexpected differential item performance. 

♦ 

This same item is portrayed in Figures 3 and 4. Notice the largest % 
.differences are at the lower to' middle portion of the scaled score range, 
where vthe majority of the candidates are. Examination of this item revealed 

7 ' ' 

that a certain .knowledge of hunting and ^fishing are required to answer 

correctly. It should be noted that this form of the SAT was developed prior 

< 

to the institution of. formal sensitivity reviews. 

The second study divided the candidate population into three subgroups 

based on reported level of fathers* education, a variable relate'd to 

4 



• . ■■ ■ '1 
socioeconomic status. The .education levels defining the first 'study group, 

second study group, and base group were: less than high school degree, high 

school degree but less than bachelor* 6 degree, and bachelor's degree or 

higher, respectively. ' Thijs each- item was evaluated twice, once with respect 

v 

to each study group, while maintaining the same base group. ^ ^ 

Examination of discrepancy index summary statistics revealed that there 
.Was little evidence of systematic unexpected dif f erent'ial item performance by 
either study group on SAT-V,. SAT-*M or TSWE. The same conclusion was reached 
by inspection of frequency distributions and plots of item discrepancy indices 
such as the one in Figur^ 6. The results of this study seem to indicate that 
the items on the SAT and TSWE forms used in this study are equally appropriate 
for all candid^es regardless of father's level of education. 

•a 

The third study divided the candidate population into/ two subgroups based 
on reported anr.wers to a racial/ethnic background question. The Oriental 
group (including Asian Americans and Pacific Islanders as well) was designated 
as the study group, while tne White (or Caucasian) group served as the base 
group. Whereas studies I and II had found few or no outliers, this 
investigation detected 52 (out of 195) items which displayed questionable 
levels of unexpected differential item performance. Figure 7 indica'^^^ 
clearly that unexpected differential item perf<iirmance between Oriental and 
White candidates was rather widespread on this particular mathematical test 
form. Similar plots were, observed for SAT-V and TSWE. * 

Two factors were identified which may help account for the abundance of 
items identified: 1) since a sizeable percentage of the Oriental group 
reported that English is not their best language, it was isuggested that items' 



\ 

■ ^ 
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covering verbal skills which this subgroup had not mastered would appear 
V . ■ • 

... . . • v. , I 

unduly difficult for theiri; and 2)' the, sample size obtained for |he Orieatal 
group may have been too, small to accurately estimate conditio'nai\percent^ . ^ 
correct. The language hypothesis was tested on the mathematics set orS<^m^. 

A test developer independently . divided the math Items into categories of ;,,\ 

< ■ * . ^ ■ • ' / 

"verbally-loaded" math items, Vpure" math items, and "neutral" math items.. 

Analysis of the discrepancy indices on items in, each category supported' the 

explanation we proposed, as the "verbally-loaded" category had the jnost ^ 

.unexpectedly difficult items for the Oriental group, while the "pure" math 

category had tTie most unexpectedly easy items/for the Oriental grbup. The 

effects of small sample size combined with the heterogeneous composition of 

the Oriental sample on the non-paratnetric item-^est regression curVee is 

apparent in Figure 8. Observe the- erratic pattern of stars in this "plot 

This study demonstrates that in situations where thp test becomes multi- 
dimensional for one of the groups, the ^scaled score msfy not be an effective , 
control variable. These results sugg^est thafXflirther investigations of 
SAT/TSWE items need to be done where the . Oriental group is restricted to Jthose 
for whom English is the best language. ; ^ - 

The fourth and final study divided the candidate population into two 
subgroups based on reported answers to a racial/ethnic background question. 
The Black group was designated as the study group, while the White group 
served as the base group. Examination of discrepihcy index summary statistics 
at the item type level reyealed an interesting finding: Analogy type items 
appeared to -be unexpectedly more difficult for Blacks than for Whites. Since 
this result is consistent with previous research on the SAT (see Dorans (1982) 



for a review) , and is not readily exf^lainatj^e > it suggests the ne^d for 
^additional 'research in order to determine possible fact'ors or ch\aract^istics 
* of the analogy type items which may be related to' ethnicity. ^Further analyses 

revealed thaj the test, as a whole, was relatively* free f rom.unext)^.cted 
.differential item performance between Blacks and l^ites. ^ost evidetice ;pf 
ur^^xpected differential item performance was limited to a few itdms, and only 
one of these exhibited a .clearly unacceptable level. The non^^arametric 
item-test regressions , for thi^ item (and their differences) are presented ' in' 
Figures 9 and 10. Inspection of the item content provided no insight to 
account for the differential^performance bbsejrved. on< the item. Additional 
analyses and examination of the item Tjy test^evelogment staff are^ 
reconupende/i • . ' ^ ^ 

Ok 

.In sum, thev standarduzatLon jnethdd ' seems t;o be an effe(5tive means of 
comparing the item performance ^f^ groups who differ greatly in ability. Its 
majbr drawback is, probably the lai^ge sample sizes that it recjuires, but fox 
it^.current application to the.SAT/TSWE this i^ not a* seidou's weakness . 
Furthermore, the visual, displays that it provides, both at the item and test 
•level, are valuable aides to data interpretation. 
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Conditional Probabilities of Sucd^ssful Item Performance 
for Males and Females on Two V'6xbal Items from SAT Form ZSA5 
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Figure 1 
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Difference Plots of Two Verbal Items from SAT Form ZSA5 



Figure 2 



Figure 4 
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Figure 5 



Plot of Root Mein Weighted Squared Differences (RMWSD) Between 
the Conditional Probabilities of Success for Male and Female 
Candidates on Verbal Items from SAT Form ZSA5 
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RMWSD equals the distance from the origin to the point representing the Item, Projection of each point on the 
horizontal axis yields the difference between and P^, D^, for that Item. Projection of each point on the 

vertical axis yields the standard deviation of the weighted differences, an idex of residual crossover. 
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Figure 6 

Plot of Root Mean Weighted Squared Differences (RMWSD^) 
Between the Conditional Rrobabilities of Success for 
Study Group 2*and the Base Group on Verbal Items from 
SAT Form CSA2 
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. , Figure;. BEST COPT ^S^^^ 

Plot of Root Mean Weighted Squared Differences (RMWSD^) 
Between the Conditional Probabilities of Success for 
Orientals and ^fhites on Verbal Items from SAT Form GSA6 
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. • Figure 8 

Example of Variability in the Conditional 
Probabilities of Successful Item Performance for 
Orientals on a Math Item from SAT Form CSA6 
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Figure 9.^ ^ 

Conditional Probabilities of Successful Item Performance fb^r 
Blacks and Whites on Analogy Item from SAT Form CSA6 



ANALOGY 



z 

UJ 

o 
q: 
Ixl 
a. 



UJ ^ 

tr o 

tr ^ 
o 




0 



0 ttt 



^6o 300 



— i 1 1 — 

400 500 600 

SCALEO SCORE 



— I — t tt 

700 eoo 



t t BLACK 
□ □ WHITE 



Figure 10 

Differences Between Conditional Probabilities 
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